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Abstract 

The  need  to  generate  new  views  of  a  3D  object  from  a  single  real  image  arises  in  several  fields,  including 
graphics  and  object  recognition.  While  the  traditional  approach  relies  on  the  use  of  3D  models,  we  have 
recentlv  introduced  [11,  6,  5]  techniques  that  are  applicable  under  restricted  conditions  but  simpler.  The 
approach  exploits  image  transformations  that  are  specific  to  the  relevant  object  class  and  learnable  from 
example  views  of  other  “prototypical”  objects  of  the  same  class. 

In  this  paper,  we  introduce  such  a  new  technique  by  extending  the  notion  of  linear  class  first  proposed 
by  Poggio  and  Vetter  [12].  For  linear  object  classes  it  is  shown  that  linear  transformations  can  be  learned 
exactly  from  a  basis  set  of  2D  prototypical  views.  We  demonstrate  the  approach  on  artificial  objects  and 
then  show  preliminary  evidence  that  the  technique  can  effectively  ’’rotate”  high-resolution  face  images 
from  a  single  2D  view. 
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1  Introduction 

\*ir^\v-l)ase(l  approaches  to  3D  object  rf:^coginr ion  and 
graphics  may  avoid  the  explicit  use  of  3D  models  by 
(exploit  iiiif  the  memory  of  several  views  of  the  object  and 
!  hr  abdity  to  interpolate  or  generalize  among  tiiem.  In 
many  situations  however  a  suffiCKUit  number  of  vi(wvs 
may  not  be  available.  In  an  extreme  case  we  may  have 
TO  do  with  only  one  real  view.  Consider  for  instance  the 
problem  of  recognizing  a  specific  human  face  under  a  dif¬ 
ferent  pose  or  expression  when  only  one  exam[)le  picture 
is  given.  Our  visual  system  is  certainly  ribh^  to  perform 
this  task  even  if  at  performanca^  hwels  tliat  are  likely  to 
be  lower  than  expected  from  our  introspection  ilO,  15]. 
rite  obvious  explanation  is  that  we  exfdoii  fuhor  informa¬ 
tion  about  how  face  images  transform.  Darned  through 
fxxtensive  experience  with  other  faces.  Thus  the  key  idea 
(see  [12]),  is  to  learn  class-specific  image-plane  transfor¬ 
mations  from  examples  of  objects  of  the  same  class  and 
then  to  apply  them  to  the  real  image  of  the  new  object  in 
order  to  synthesize  virtual  views  that  can  be  used  as  ad¬ 
ditional  examples  in  a  view-based  objtau  recognition  or 
graphic  system.  Prior  knowledge  about  a  class  of  obj('cts 
may  be  known  in  terms  of  invariance  properties.  Poggio 
and  Vetter  [P2]  examined  in  particular  the  case  of  bilat¬ 
eral  symmetry  of  certain  3D  objects,  such  as  faces.  Prior 
information  about  bilateral  symmetry  allows  the  synthe¬ 
sis  of  new  virtual  views  from  a  single  real  one.  thereby 
simplifying  the  task  of  generalization  in  recognition  of 
the  new  object  under  different  poses.  Bilateral  symme¬ 
try  has  been  used  in  face  recognition  systems  [5]  and 
psychophysical  evidence  supports  its  use  by  the  human 
visual  system  [15.  13,  18]. 

A  more  flexible  way  to  acquire  information  about  how 
images  of  objects  of  a  certain  class  change  under  pose, 
illumination  and  other  tramsformations.  is  to  learn  the 
possible  pattern  of  variabilities  and  class-specific  defor¬ 
mations  from  a  representative  training  set  of  views  of 
generic  or  prototypical  objects  of  the  same  class  -  such  as 
other  faces.  Although  our  approach  originates  from  the 
proposal  of  Poggio  and  Brunelli  [111  and  of  Poggio  and 
\etter  [12],  for  countering  the  curse-of-dimensionality  in 
applications  of  supervised  learning  techniques,  similar 
approaches  with  different  motivations  have  been  used 
in  several  different  fields.  In  computer  graphics,  actor- 
based  animation  has  been  used  to  generate  sequences  of 
views  of  a  character  by  warping  an  available  sequence 
of  a  similar  character.  In  computer  vision  the  approach 
closest  to  the  first  part  of  ours  is  the  active  shape  models 
of  Cootes,  Taylor,  Cooper  and  Graham  ild].  They  build 
flexible  models  of  known  rigid  objects  by  linear  combi¬ 
nation  of  labeled  examples  for  the  task  of  image  search 
-  recognition  and  localization.  In  all  of  these  approaches 
the  underlying  representation  of  images  of  the  new’  object 
are  in  terms  of  linear  combinations  of  the  shape  of  exam¬ 
ples  of  representative  other  objects.  Beymer.  Shashua 
and  Poggio  [6]  as  w^ell  as  Beymer  and  Poggio  [5]  have 
developed  and  demonstrated  a  more  powerful  version 
of  this  approach  based  on  non-linear  learning  networks 
for  generating  new  grey-level  images  of  the  same  object 
or  of  objects  of  a  known  class,  Beymer  and  Poggio  [5] 
also  demonstrated  that  new  textures  of  an  object  can  be 


g(Mierat.ed  by  linear  combinations  of  textures  of  differ¬ 
ent  objects.  In  this  paper,  we  extend  and  introduce  the 
t.echni(|ue  of  linear  classes  to  generate  new^  views  of  an 
object.  The  technique  is  similar  to  the  approach  of  [5,  6] 
but  more  powx'rful  since  it  relies  less  on  correspondence 
between  prototypical  (examples  and  the  new  image. 

The  work  described  in  this  paper  is  based  on  the  idea 
of  linear  object  classes.  These  are  3D  objects  whose  3D 
shape  can  be  represented  as  a  linear  combination  of  a 
sufficiently  small  number  of  prototypical  objects.  Linear 
object  classes  have  the  ftroperties  that  new^  orthographic 
views  of  any  object  of  the  class  under  uniform  affine  SD 
transformations,  and  in  particular  rigid  transformations 
in  31),  can  be  generated  exactly  if  the  corresponding 
transformed  views  are  known  for  the  set  of  prototypes. 
Thus  if  the  training  set  consist  of  frontal  and  rotated 
views  of  a  set  of  prototype  faces,  any  rotated  view  of  a 
new  face  can  be  generated  from  a  single  frontal  view  - 
provided  that  the  linear  class  assumption  holds.  In  this 
paper,  we  show  that  the  technique,  first  introduced  for 
shapoonly  objects  can  be  extended  to  their  grey-level  or 
colour  values  as  well,  which  we  call  texture. 

Key  to  our  approach  is  a  representation  of  an  object 
view  in  terms  of  a  shajie  vector  and  a  texture  vector  (see 
also  Jones  and  Poggio  [9]  and  Beymer  and  Poggio  [5]). 
The  first  gives  the  image-plane  coordinates  of  feature 
points  of  the  object  surface;  the  second  provides  their 
colour  or  grey-level.  On  the  image  plane  the  shape  vec¬ 
tor  reflects  geometric  transformation  in  the  image  due  to 
a  change  in  view^  point,  whereas  the  texture  vector  cap¬ 
tures  photometric  effects,  often  also  due  to  viewpoint 
changes. 

For  linear  object  classes  the  new  image  of  an  object 
of  the  class  is  analyzed  in  terms  of  shape  and  texture 
vectors  of  prototype  objects  in  the  same  pose.  This  re¬ 
quires  correspondence  to  be  established  between  all  fea¬ 
ture  points  of  the  prototype  images  -  both  frontal  and 
rotated  -  which  can  b(^  done  in  a  off-line  stage  and  does 
not  need  to  be  automatic.  It  also  require  correspondence 
between  the  new  image  and  one  of  the  prototypes  in  the 
same  pose  but  does  not  need  correspondence  between 
different  poses  as  in  the  parallel  deformation  technique 
of  Poggio  and  Brunelli  [11]  and  Beymer  et  al.[6]. 

The  paper  is  organized  as  follows.  The  next  section 
formally  introduces  linear  object  classes,  first  for  objects 
defined  only  through  their  shape  vector.  Later  in  the 
section  we  extend  the  technique  to  objects  with  textures 
and  characterize  the  surface  reflectance  models  for  which 
our  linear  class  approach  is  valid.  Section  3  describes  an 
implementation  of  the  technique  for  synthetic  objects 
for  which  the  linear  class  assumption  is  satisfied  by  con¬ 
struction.  In  the  last  section  we  address  the  key  question 
of  whether  the  assumption  is  a  sufficiently  good  approx¬ 
imation  for  real  objects.  We  consider  images  of  faces 
and  demonstrate  promising  results  that  indirectly  sup¬ 
port  the  conjecture  that  faces  are  a  linear  class  at  least  to 
a  first  approximation.  The  discussion  reviews  the  main 
features  of  the  technique  and  its  future  extensions. 


2  Linear  Object  Classes 

Three-dimensional  objects  differ  in  shape  as  well  as  in 
texture.  In  the  following  we  will  derive  an  object  repre¬ 
sentation  consisting  of  a  separate  texture  vector  and  a 
2D-shape  vector,  each  one  with  components  referring  to 
the  same  feature  points,  usually  pixels.  Assuming  cor¬ 
respondence,  we  will  represent  an  image  as  follows;  we 
code  its  2D-shape  as  the  deformation  field  of  selected 
feature  points  -  in  the  limit  pixels  -  from  a.  reference  im¬ 
age  which  serves  as  the  origin  of  our  coordinate  system. 
Tlie  texture  is  coded  as  the  intensity  map  of  the  image 
with  feature  points  e.g.  pixels  set  in  correspondence  with 
the  reference  image.  Thus  each  component  of  the  shape 
and  the  feature  vector  refers  to  the  same  feature  point 
e.g.  pixel.  In  this  setting  2D-shape  and  texture  can  be 
treated  separately.  We  will  derive  the  necessary  and  suf¬ 
ficient  conditions  for  a  set  of  objects  to  be  a  linear  object 
class. 

2.1  Shape  of  3D  objects 

Consider  a  3D  view  of  an  three-dimensional  ob¬ 
ject,  which  is  defined  in  terms  of  pointwise  features 
[12].  A  3D  view  can  be  represented  by  a  vector 

X  :=  {xi,yi,zi,X2,, . that  is  by  the  x,y,z- 

coordinates  of  its  n  feature  points.  Assume  that  X  G  3?^^ 
is  the  linear  combination  of  q  3D  views  Xj  of  other  ob¬ 
jects  of  the  same  dimensionality,  such  that: 

X  =  ^QiX,-.  (1) 

X  is  then  the  linear  combination  of  q  vectors  in  a  3n 
dimensional  space,  each  vector  representing  an  object  of 
n  pointwise  features.  Consider  now  the  linear  operator  L 
associated  with  a  desired  uniform  transformation  such  as 
for  instance  a  specific  rotation  in  3D.  Let  us  define  X^  = 
LX  the  rotated  3D  view  of  object  X.  Because  of  the 
linearity  of  the  group  of  uniform  linear  transformations 
L,  it  follows  that 

X”-  =  ^  a,:X^  (2) 

i  =  l 

Thus,  if  a  3D  view  of  an  object  can  be  represented  as  the 
weighted  sum  of  views  of  other  objects,  its  rotated  view 
IS  a  linear  combination  of  the  rotated  views  of  the  other 
objects  with  the  same  weights.  Of  course  for  an  arbitrary 
2D  view  that  is  a  projection  of  a  3D  view,  a  decomposi¬ 
tion  like  (1)  does  not  in  general  imply  a  decomposition 
of  the  rotated  2D  views  (it  is  a  necessary  but  not  a  suf¬ 
ficient  condition). 

3D  projections  of  3D  objects 

The  question  we  want  to  answer  here  is,  “Under  which 
conditions  the  2D  projections  of  3D  objects  satisfy  equa¬ 
tion  (T)  to  (2)?”  The  answer  will  clearly  depend  on  the 
types  of  objects  we  use  and  also  on  the  projections  we 
allow.  We  define: 

A  set  of  3D  views  (of  objects)  {Xj}  is  a  linear  ob¬ 
ject  class  under  a  linear  projection  P  if  = 


Figure  1:  Learning  an  image  transformation  according 
to  a  rotation  of  three-dimensional  cuboids  from  one  ori¬ 
entation  (upper  row)  to  a  new  orientation  (lower  row). 
The  'test'  cuboid  (upper  row  right)  can  he  represented  as 
a  linear  combination  of  the  two-dimensional  coordinates 
of  the  three  example  cuboids  in  the  upper  row.  The  lin¬ 
ear  combination  of  the  three  example  views  in  the  lower 
row,  using  the  coefficients  evaluated  in  the  upper  row. 
results  in  the  correct  transformed  view  of  the  test  cuboid 
as  output  (lower  row  right).  Notice  that  correspondence 
between  views  in  the  two  different  orientations  is-  not 
needed  and  different  points  of  the  object  may  be  occluded 
in  the  different  orientations. 

dim{PXi}  with  X*  G  and  PX,-  G  3?^  and  p  <  3n 

This  is  equivalent  to  saying  that  the  minimal  number 
of  basis  objects  necessary  to  represent  a  object  is  not 
allowed  to  change  under  the  projection.  Note  that  the 
linear  projection  P  is  not  restricted  to  projections  from 
3D  to  2D,  but  may  also  “drop”  occluded  points.  Now 
assume  x  —  PX  and  x*  =  PXi  being  the  projections  of 
elements  of  an  linear  object  class  with 

X  =  (3) 

then  x"'  =  PX^  can  be  constructed  without  knowing 

X^ using  ai  of  equation  (3)  and  the  given  x-  =  PXJ*  of 

the  other  objects. 

x''  =  y^aix[.  (4) 

i  =  l 

These  relations  suggest  that  we  can  use  “prototypical" 
2D  views  (the  projections  of  a  basis  of  a  linear  object 
class)  and  their  known  transformations  to  synthesize  an 
operator  that  will  transform  a  2D  view  into  a  new  2D 
view  when  the  object  is  a  linear  combination  of  the  pro¬ 
totypes.  In  other  words  we  can  compute  a  new  2D  view 
of  such  an  object  without  knowing  explicitly  its  three- 
dimensional  structure.  Notice  also,  that  knowledge  of 


Uh:*  rr}r:-<ponclenc('  between  equation  (a)  and  erjiiation 
(•1 !  is  U'A  necessary  (rows  in  a  linear  e(|uation  system  can 
i'"  an‘:tnd  freely),  rinuefotan  1  lu'  t('chni(jU('  dta-s  nol 
rfCinifH  *  Cl  compute  the  correspoiuhuice  Ixq  waani  views 
from  d;::-Teni  viewpoints,  fn  fact,  some  pf)inis  may  be 
iV'clnU'-  Figure  1  shows  a  very  sinijde  (vxamjde  n[  a 
linear  •..eject  class  and  the  construction  of  a  new  view 
of  an  object.  Taking  the  8  corners  of  a  cuboid  as  fea¬ 
tures.  a  uD  view  X,  as  defined  al)ove.  is  an  element  of 
p-  i.  [,,--,vever.  the  dimension  of  the  class  of  all  cul^oids 
is  only  so  any  cuboid  can  l.)e  represented  as  a  linear 
comlunation  of  three  cul)oids.  For  any  proj('ction.  that 
preserve  rliese  3  dimensions,  we  can  apply  {.'Cjualions  i3) 
and  Me  Fhe  projection  in  figure  1  projects  all  non  oc¬ 
cluded  ■^..'■rners  orthograplucally  onto  the  image-[)lane  ( 
X  PX  €  preserving  the  dimensionality,  .\orice. 

tl;at  the  orthographic  projection  of  an  mxactly  frontal 
view  of  a  cuboid,  which  would  result  in  a  rectangle  as 
image,  would  preserve  2  dimensions  only,  so  equation  (4) 
could  i;:t  guarantee  the  correct  result, 

B'dTr*-  '-uqdying  this  idea  to  grey-lev(4  imag(’s.  we  would 
like  to  introduce  a  helpful  change  of  coordinate  systems 
in  equations  (3)  and  (4).  Instead  of  using  an  absolute 
coordinate  system,  we  represent  the  views  as  the  differ¬ 
ence  to  the  view  of  a  reference  object  of  the  same  class, 
in  terms  of  the  spatial  differences  of  conx'sponding  fea¬ 
ture  points  in  the  images.  .Subtracting  on  both  sides  of 
f^quatiews  (3)  and  (4)  the  project  ion  ol'a  r(4erence  object 
gives  us 

7 

Ax  =  a  i  Ax/  ( 0 ) 

f  =  1 

and 

7 

Ax'’  =  ^(a^AxJT  (6) 

i-i 

After  this  change  in  the  coordinate  system,  equation 
(6)  novc  evaluates  to  the  new  difference  vector  to  the  ro¬ 
tated  r-ference  view.  The  new  view  ol‘  the  olyject  can 
be  constructed  by  adding  this  difference  to  the  reference 
view. 

2.2  Texture  of  3D  objects 

In  this  section  we  extend  our  linear  space  model  from 
a  representation  based  on  feature  points  to  full  images 
of  objects,  in  the  following  we  assume  that  the  objects 
are  isolated,  that  is  properly  segmented  from  the  back¬ 
ground.  To  apply  equations  (5)  and  (6)  to  images,  the 
difference  vectors  between  an  image  of  a  reference  object 
and  the  images  of  the  other  objects  have  to  be  computed. 
Since  the  difference  vectors  reflect  the  spatial  difference 
of  corresponding  pixels  in  images,  this  correspondence 
has  to  oe  computed  first.  The  problem  of  finding  corre¬ 
spondence  between  images  in  general  is  difficult  and  out¬ 
side  the  scope  of  this  paper.  In  the  following  we  a.ssume 
that  the  correspondence  is  given  for  every  pixel  in  the 
image.  In  our  implementation  (see  next  section)  w^^  ap¬ 
proximated  this  correspondence  fields  using  a  standard 
optical  how  technique.  For  an  image  of  n-by-n  pixels  Ax 


in  (Vj nations  (u)  and  (6)  nre  the  correspondence  fields  of 
the  images  to  a  ixdeiamce  image  with  Ax  G  . 

F!i('  computed  (‘orr(\spoiideii(>'  between  images  en¬ 
ables  a  r(q)r('sentat.ion  of  the  imatie  that  separates  2D- 
shapj'  and  texture  i nformal ui.  File  2D-sha.pe  of  an  im¬ 
age  i.s  rod(al  as  a  v('('tor  r(qu*eseiiting  the  deformation 
field  relative  to  a  reference  image.  The  texture  informa¬ 
tion  is  co(h?d  in  terms  of  a  vector  which  holds  for  each 
pixel  the  texture  map  that  results  from  mapping  the  im¬ 
age  onto  the  reference  image  through  the  deformation 
iicld.  hi  this  representation,  all  images  -  the  shape  vec¬ 
tor  and  tJiC'  texture  vector  are  \>a:torized  relative  to  the 
reference  image.  Since  the  texture  or  image  irradiance 
of  an  object  is  in  general  a  complex  function  of  albedo, 
surface  orientation  and  the  direction  of  illumination,  we 
have  to  distinguish  different  situations. 

Let  us  first  consider  the  easy  ca.se  of  objects  all  with 
the  same  identical  texture:  corresponding  pixels  in  each 
image  have  the  same  intensity  or  color.  In  this  situation 
a  single  l.(\\l.iirc  map  ((Vg.  the  reference  image)  is  suffi¬ 
cient.  Assuming  a  linear  object  class  as  described  ear¬ 
lier.  the  shape  coefficients  o,  can  be  computed  (equation 
5)  and  result  (equation  (3)  in  the  correspondence  field 
from  the  reference  image  in  the  second  orientation  to  the 
new  'virtual'  image.  To  render  the  'virtual’  image,  the 
reference  image  has  to  be  warped  along  this  correspon¬ 
dence  field.  In  other  words  the  reference  image  must  be 
mapped  onto  the  image  locations  given  through  the  cor¬ 
respondence  field.  In  Figure  2  the  method  is  applied  to 
grey  level  images  of  three-dimensional  computer  graphic 
models  of  five  dog-like  olyjects.  The  'dogs'  are  shown  in 
two  orientations  and  four  examples  of  this  transforma¬ 
tion  from  one  orientation  to  the  other  are  given.  Only  a 
single  test  view  of  a  different  dog  is  given.  In  each  orien¬ 
tation.  the  correspondence  from  a  chosen  reference  image 
(dashed  box)  to  the  other  images  is  computed  separately 
(see  also  section  TVn  implementation').  Since  the  dogs 
were  created  in  such  a  way  that  the  three-dimensional 
objects  form  a  linear  object  class,  the  correspondence 
field  to  the  test  image  could  be  deconiposed  exactly  into 
the  other  fields  (upper  row).  Applying  the  coefficients 
of  this  decomposition  to  the  correspondence  fields  of  the 
second  orientation  results  in  the  correspondence  of  the 
reference  image  to  a  new  image,  showing  the  test  object 
in  the  second  orientation.  Fliis  new  image  (“output’'  in 
the  lower  row)  was  created  by  simply  warping  the  ref¬ 
erence  image  along  this  correspondence  field,  since  all 
objects  had  the  same  texture.  Since  in  this  test  a  three- 
dimensional  model  of  the  object  was  available,  the  syn¬ 
thesized  output  could  1)0  compared  to  the  model.  As 
shown  in  the  difference  image,  there  is  only  a  small  er¬ 
ror,  which  can  be  attributed  to  minor  errors  in  the  cor¬ 
respondence  step.  This  example  shows  that  the  method 
combined  with  standard  image  matching  algorithms  is 
able  to  transform  an  image  in  a  way  that  shows  an  ob¬ 
ject  from  a  new  viewpoint. 

Let  us  next  consider  the  situation  in  which  the  texture 
is  a  function  of  albedo  only,  tliat  is  independent  of  the 
surface  normal.  Then  a  linear  texture  class  can  be  for¬ 
mulated  in  a  way  equivalent  to  equations  (1)  through 
(4).  This  is  possible  since  the  textures  of  all  objects  were 


Figure  2:  Grey  level  images  of  an  artificial  linear  object  class  are  rendered.  The  correspondence  between  the  images 
of  a  reference  object  (dashed  box)  and  the  other  examples  are  computed  separately  for  each  orientation.  The  corre¬ 
spondence  field  between  the  test  image  and  the  reference  image  is  computed  and  linearly  decomposed  into  the  other 
fields  (upper  row).  A  new  correspondence  field  is  synthesized  applying  the  coefficienis  from  this  decomposition  to  the 
fields  from  the  reference  image  to  the  examples  in  the  lower  row.  The  output  is  generated  by  forward  warping  the 
reference  image  along  this  new  correspondence  field.  In  the  difference  image  between  the  new  image  and  the  image  of 
the  true  3D  model  (lower  row,  right),  the  missing  parts  are  marked  white  whereas  the  parts  not  existing  in  an  image 
of  the  model  are  in  black. 


mapped  along  the  computed  deformation  fields  onto  the 
reference  image,  so  all  corresponding  pixels  in  the  images 
are  mapped  to  the  same  pixel  location  in  the  reference 
image.  The  equation 

t  =  ^  3iti  (7) 

!  =  1 

with  0i  (different  to  a;  in  equation  (3))  implies 

f  =  ^  J,t[  (8) 

i  =  l 

assuming  that  the  appearance  of  the  texture  is  indepen¬ 
dent  of  the  surface  orientation  and  the  projection  does 
not  change  the  dimensionality  of  the  texture  space.  Here 
we  are  in  the  nice  situation  of  a  separate  shape  and  tex¬ 
ture  space.  In  an  application  the  coefficients  for  the 
shape  and  coefficients  fii  for  the  texture  can  be  computed 
separately.  In  face  recognition  experiments  [5]  the  coef¬ 
ficients  were  already  used  to  generate  a  new  texture 
of  a  faces  using  textures  of  differnt  faces.  Figure  3  shows 
a  test  of  this  linear  approach  for  a  separated  2D-shape 
and  texture  space  in  combination  with  the  approximated 
correspondence.  Three  example  faces  are  shown,  each 
from  two  different  viewpoints  accordingly  to  a  rotation 
of  22.5°.  Since  the  class  of  all  faces  has  more  than  three 
dimensions  a  synthetic  face  image  is  used  to  test  the 


method.  This  synthetic  face  is  generated  by  a  standard 
morphing  technique  [1]  between  the  two  upper  left  im¬ 
ages.  This  ensures  that  the  necessary  requirements  for 
the  linear  class  assumption  hold,  that  is  the  test  image 
is  a  linear  combination  of  the  example  images  in  texture 
and  2D-shape.  In  the  first  step  for  each  orientation  the 
correspondence  between  a  reference  face  (dashed  box) 
and  the  other  faces  is  computed.  Using  the  same  pro¬ 
cedure  described  earlier,  the  correspondence  field  to  the 
test  image  is  decomposed  into  the  other  fields  evaluating 
the  coefficients  a,-.  Differently  from  figure  2,  the  textures 
are  mapped  onto  the  reference  face.  Now  the  texture  of 
the  test  face  can  be  linearly  decomposed  into  the  textures 
of  the  example  faces.  Applying  the  resulting  coefficients 
fii  to  the  textures  of  the  example  faces  in  the  second 
orientation  (lower  row  of  figure  3),  we  generate  a  new 
texture  mapped  onto  the  reference  face.  This  new  tex¬ 
ture  is  now  warped  along  the  new  correspondence  field. 
This  new  field  is  evaluated  applying  the  coefficients  ai 
to  the  correspondence  fields  of  the  examples  to  the  ref¬ 
erence  face  in  the  second  orientation.  The  output  of  this 
procedure  is  shown  below  the  test  image.  Since  the  in¬ 
put  is  synthetic,  this  result  can  not  be  compared  to  the 
true  rotated  face,  so  it  is  up  to  the  observer  to  judge  the 
quality  of  the  applied  transformation  of  the  test  image. 

There  is  a  third  case  to  consider.  When  the  texture 

is  a  function  of  the  surface  normal  n  at  each  point,  then 
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Figure  3:  Three  human  example  faces  are  shown,  each  in  two  oneniations  (the  three  left  columns),  one  of  these  faces 
is  used  as  reference  face  (dashed  box).  .4  synthetic  face,  a  ‘morph  between  the  two  upper  left  images,  is  used  as  a 
test  face  to  ensure  the  linear  combination  constraint  (upper  right).  I  he  procedure  of  decomposing  and  synthesizing 
the  corre,spondences  fields  is  as  described  in  figure  J.  .Additionally  all  textures,  for  each  orientation  separately,  are 
mapped  onto  the  reference  face.  Here  the  test  texture  is  decomposed  into  the  other  example  textures,  hsing  the 
evaluated  coefficients  a  new  texture  is  synthesized  for  the  second  orientation  on  the  reference  face.  The  final  output, 
the  transformed  test  face,  is  generated  by  warping  this  new  texture  along  the  new  synthesized  correspondence  field. 


the  situation  is  more  restricted.  Equation  (7)  becomes: 

t(ri)  =  ^  Xtiiiii).  (9) 

On  the  other  hand,  equation  (2)  implies  = 

J21=[  •  Now  equation  (8)  becomes 

■/  '/ 

t''(  Y^ariV;  )  =  ^  (10) 

1  =  I  Z  =  1 

This  condition  limits  the  freedom  of  the  possible  tex¬ 
tures.  In  the  case  of  Lambertian  surfaces  with  a  constant 
light  source  the  texture  is  a  linear  function  of  the  surface 
normal  n  and  equation  (10)  can  be  solved  with  0i  —  ai. 
In  this  case  equations  (5)  and  (9)  can  be  solved  with 
0i  ~  (\j  to  ensure  the  correct  result  in  equation  (10). 

3  An  Implementation 

The  implementation  of  this  method  for  grey-level  pixel 
images  can  be  divided  into  three  steps.  First,  the  corre¬ 
spondence  between  the  images  of  the  objects  has  to  be 
computed.  Second,  the  correspondence  field  to  the  new 


image  has  to  be  linearly  decomposed  into  the  correspon¬ 
dence  fields  of  the  examples.  The  same  decomposition 
has  to  be  carried  out  for  the  new  texture  in  terms  of 
the  example  textures.  And  finally  we  synthesize  the  new 
image,  showing  the  object  from  the  new  viewpoint. 

3.1  Computation  of  the  Correspondence 

To  compute  the  differences  Ar  used  in  equations  (5)  and 
:  b),  which  are  the  spatial  distances  between  correspond¬ 
ing  points  of  the  objects  in  the  images,  the  correspon¬ 
dence  of  this  points  has  to  be  established  first.  That 
means  we  have  to  find  for  every  pixel  location  in  an  im¬ 
age,  e.g.  a  pixel  located  on  the  nose,  the  corresponding 
pixel  location  on  the  nose  in  the  other  image.  This  is 
in  general  a  hard  problem.  However,  since  all  objects 
compared  here  are  in  the  same  orientation,  we  can  of¬ 
ten  assume  that  the  images  are  quite  similar  and  that 
occlusion  problems  should  usually  be  negligible.  These 
conditions  make  it  feasible  to  compare  the  images  of  the 
different  objects  with  automatic  techniques.  Such  al¬ 
gorithms  are  known  from  o|)tical  flow  computation,  in 
which  points  have  to  be  tracked  from  one  image  to  the 
ocher.  We  use  a  coarse- to-fine  gradient-based  gradient 


method  [2]  and  follow  an  implementation  described  in 
[31.  For  every  point  x,y  in  an  image  /.  the  error  term 
E  -  +  lySy  —  is  minimized  for  bx,  by,  with 

Ij--  /■;  being  the  spatial  image  derivatives  and  SI  the  dif¬ 
ference  of  intensity  of  the  two  compared  images.  The 
coar.se-to-fine  strategy  refines  the  computed  displace¬ 
ments  when  finer  levels  are  processed.  The  final  result  of 
this  computation  (<5 at,  Sy)  is  used  as  an  approximation  of 
the  spatial  displacement  ( Aa:  in  equation  (5)and  (6))of  a 
])ixel  from  one  image  to  the  other.  The  correspondence 
is  computed  in  the  direction  towards  the  reference  image 
from  the  example  and  the  test  images.  As  a  consequence 
all  vector  fields  have  a  common  origin  at  the  pixel  loca¬ 
tions  of  the  reference  image. 

3.2  Learning  the  Linear  Transformation 

The  decomposition  of  a  given  correspondence  field  in 
equation  (5)  and  the  composition  of  the  new  field  in 
equation  (6)  can  be  understood  as  a  single  linear  trans¬ 
formation.  First,  we  compute  the  coefficients  o,-  for  the 
optimal  decomposition  (in  the  sense  of  lea^t  square).  We 
decompose  a^dnitial”  field  Ax  to  a  new  object  A  into  the 
‘initial"  fields  Ax,*  to  the  q  given  prototypes  by  minimiz¬ 
ing  , 

||AX-  ^Q;iAx,;||v  (11) 

i=l 

We  rewrite  equation  (5)  as  Ax  -  $a  where  $  is  the 
matrix  formed  by  the  q  vectors  Axy  arranged  column¬ 
wise  and  a  is  the  column  vector  of  the  a/  coefficients. 
Minimizing  equation  (11)  gives 

«  =  ($)+ Ax.  (12) 

The  observation  of  the  previous  section  implies  that  the 
operator  L  that  transforms  Ax  into  Ax^  through  Ax^  = 
LAx,  is  given  by 

Ax^  =  =  $^$“^Ax  as  L  —  (13) 

and  thus  can  be  learned  from  the  2D  example  pairs 
(Ax,.  Ax-).  In  this  case,  a  one-layer,  linear  network 
(compare  Hurlbert  and  Poggio,  1988)  can  be  used  to 
learn  the  transformation  L.  L  can  then  transform  a  view 
of  a  novel  object  of  the  same  class.  If  the  q  examples  are 

linearly  independent  is  given  by  =  (#^$) 

in  the  other  cases  equation  (11)  was  solved  by  an  SVD 

algorithm. 

Before  decomposing  the  new  texture  into  the  example 
textures,  all  textures  have  to  be  mapped  onto  a  common 
basis.  Using  the  correspondence,  we  warped  all  images 
onto  the  reference  image.  In  this  representation  the  de¬ 
composition  of  the  texture  can  be  performed  as  described 
above  for  the  correspondence  fields. 

3.3  Synthesis  of  the  New^  Image. 

The  final  step  is  image  rendering.  Applying  the  com¬ 
puted  coefficients  to  the  examples  in  the  second  orien¬ 
tation  results  in  a  new  texture  and  the  correspondence 
fields  to  the  new  image.  The  new  image  can  be  generated 
combining  this  texture  and  correspondence  field.  This  is 


possible  because  both  are  given  in  the  coordinates  of  the 
reference  image.  That  means  that  for  every  pixel  in  the 
reference  image  the  pixel  value  and  the  vector  pointing 
to  the  new  location  are  given.  The  new  location  gen¬ 
erally  does  not  coincide  with  the  equally  spaced  grid  of 
pixels  of  the  destination  image.  A  commonly  used  solu¬ 
tion  of  this  problem  is  known  as  forward  warping  [19]. 
For  every  new  pixel,  we  use  the  nearest  three  points  to 
linearly  approximate  the  pixel  intensity. 

4  Is  the  linear  class  assumption  valid 
for  real  objects? 

For  man  made  objects,  which  often  consist  of  cuboids, 
cylinders  or  other  geometric  primitives,  the  assumption 
of  linear  object  classes  seems  almost  natural.  However, 
are  there  other  object  classes  which  can  be  linearly  rep¬ 
resented  by  a  finite  set  of  example  objects?  In  the  case 
of  faces  it  is  not  clear  how  many  example  faces  are  neces¬ 
sary  to  synthesize  any  other  face  and  in  fact,  it  is  unclear 
if  the  assumption  of  a  linear  class  is  appropriate  at  all. 
The  key  test  for  the  linear  class  hypothesis  in  this  case  is 
how  well  the  synthesized  rotated  face  approximates  the 
“true”  rotated  face.  We  tested  our  approach  on  a  small 
set  of  50  faces,  each  given  in  two  orientations  (22.5^  and 
0®).  Figure  4  shows  four  tests  using  the  same  technique 
as  described  in  figure  3.  In  each  case  one  face  was  se¬ 
lected  as  test  face  and  the  49  remaining  faces  were  used 
as  examples.  Each  test  face  is  shown  on  the  upper  left 
and  the  output  image  produced  by  our  technique  on  the 
lower  right,  showing  a  rotated  test  face.  The  true  ro¬ 
tated  test  face  from  the  data  base  is  shown  on  the  lower 
left.  We  also  show  in  the  upper  right  the  synthesis  of 
the  test  face  through  the  49  example  faces  in  the  test 
orientation.  This  reconstruction  of  the  test  face  should 
be  understood  as  the  projection  of  the  test  face  into  the 
shape  and  texture  space  of  the  other  49  example  faces. 
A  perfect  reconstruction  of  the  test  face  would  be  a  nec¬ 
essary  (not  sufficient!)  requirement  that  the  50  faces 
are  a  linear  object  class.  The  results  are  not  perfect 
but,  considering  the  small  size  of  the  example  set,  the 
reconstruction  is  quite  good.  The  similarity  of  the  re¬ 
construction  to  the  input  test  face  allows  to  speculate 
that  an  example  set  size  of  the  order  of  hundred  faces 
may  be  sufficient  to  construct  a  huge  variety  of  different 
faces.  We  conclude  that  the  linear  object  class  approach 
may  be  a  satisfactory  approximation  even  for  complex 
objects  as  faces.  On  the  other  hand  it  is  obvious  that 
the  reconstruction  of  every  specific  mole  or  wrinkle  in  a 
face  requires  to  an  almost  infinite  number  of  examples. 
To  overcome  this  problem,  correspondence  between  im¬ 
ages  taken  from  different  viewpoints  should  be  used  to 
map  the  specific  texture  on  the  new  orientation  [9,  5]. 

5  Discussion 

Linear  combinations  of  images  of  a  single  object  have 
been  already  successfully  used  to  create  a  new  image  of 
that  object  [16].  Here  we  created  a  new  image  of  an 
object  using  linear  combinations  of  images  of  different 
objects  of  the  same  class.  Given  only  a  single  image  of 
an  object,  we  are  able  to  generate  additional  synthetic 
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Figure  4:  Four  examples  of  ariificiallg  rotated  human  faces,  using  the  technique  described  in  figure  3  are  shown. 
Each  test  face  (upper  left)  is  "rotated”  using  fO  different  faces  (not  shown)  as  examples,  the  results  are  marked  as 
output.  Only  for  comparison  the  "true”  rotated  test  face  is  shown  on  the  lower  left  (this  face  was  not  used  in  the 
computation).  The  difference,  between  synthetic  and  real  rotated  face  is  due  to  the  incomplete  example  set,  since  the 
same  difference  can  already  be  seen  in  the  reconstruction  of  the  input  test  face  using  the  example  faces  (upper 
right ). 
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images  of  this  object  under  the  assumption  that  the  "lin¬ 
ear  class”  property  holds.  This  is  demonstrated  not  only 
for  objects  purely  defined  through  their  shape  but  also 
for  smooth  objects  with  texture. 

This  approach  based  on  two-dimensional  models  does 
not  need  any  depth  information,  so  the  sometime  diffi¬ 
cult  step  of  generating  three-dimensional  models  from 
two-dimensional  images  is  superfluous.  Since  no  cor¬ 
respondence  is  necessary  between  images,  representing 
objects  in  different  orientations,  fully  automated  algo¬ 
rithms  can  be  applied  for  the  correspondence  finding 
step.  For  object  recognition  tasks  our  approach  has  sev¬ 
eral  implications.  Our  technique  can  provide  additional 
artificial  example  images  of  an  object  when  only  a  sin¬ 
gle  image  is  given.  On  the  other  hand  the  coefficients, 
which  result  from  a  decomposition  of  shape  and  texture 
into  example  shapes  and  textures  give  us  already  a  rep¬ 
resentation  of  the  object  which  is  invariant  under  any 
affine  transformation. 

In  an  application  our  approach  is  confronted  with  two 
types  of  problems.  As  in  any  approach  based  on  flexible 
models,  there  is  the  problem  of  finding  the  correspon¬ 
dence  between  model  and  image.  In  our  implementa¬ 
tion  we  used  a  general  method  for  finding  this  corre¬ 
spondence.  However,  if  the  class  of  objects  is  known  in 
advance,  a  method  specific  to  this  object  class  could  be 
used  [9.  7].  In  this  case  the  correspondence  field  is  lin¬ 
early  modeled  by  a  known  set  of  deformations  specific  to 
that  class  of  objects. 

A  second  problem,  specific  to  our  approach  is  the  ex¬ 
istence  of  linear  object  classes  and  the  completeness  of 
the  available  examples.  This  is  equivalent  to  the  ques¬ 
tions  of  whether  object  classes  defined  in  terms  of  human 
perception  can  be  modeled  through  linear  object  classes. 
Presently  there  is  no  final  answer  to  this  question,  apart 
for  simple  objects  like  (e.g.  cuboids,  cylinders),  where 
the  dimensionality  is  given  through  their  mathematical 
definition.  The  application  of  the  method  to  a  small 
example  set  of  human  faces,  shown  here,  provides  pre¬ 
liminary  promising  results  at  least  for  some  faces.  It  is, 
however,  clear  that  50  example  faces  are  not  sufficient 
to  model  accurately  all  human  faces.  Since  our  linear 
model  allows  to  test  the  necessary  conditions  for  an  im¬ 
age  being  a  member  of  a  linear  object  class,  the  model 
can  detect  images  where  a  transformation  fails.  This  test 
can  be  done  by  measuring  the  difference  between  the  in¬ 
put  image  and  its  projection  into  the  example  space, 
which  should  ideally  vanish. 

Our  implementation,  as  described  in  our  examples,  can 
be  improved  by  applying  the  linear  class  idea  to  inde¬ 
pendent  parts  of  the  objects.  In  the  face  case,  a  new 
input  face  was  linearly  approximated  through  the  com¬ 
plete  example  faces,  that  is  for  each  example  face  a  sin¬ 
gle  coefficient  (for  texture  and  2D-shape  separately)  was 
computed.  Assume  noses,  mouths  or  eyes  span  sepa¬ 
rated  linear  subspaces,  then  the  dimensionality  of  the 
space  spanned  by  the  examples  will  be  multiplied  by  the 
number  of  subspaces.  So  in  a  new  image  the  different 
parts  will  be  approximated  separately  by  the  examples, 
that  will  increase  the  number  of  coefficients  used  as  rep¬ 
resentation  and  will  also  improve  the  reconstruction. 


S(?veral  open  questions  remain  for  a  fully  automated  im¬ 
plementation.  The  separation  of  parts  of  an  object  to 
ibrm  separated  siibspaces  could  be  done  by  computing 
the  covariance  between  the  pixels  of  the  example  images. 
However,  for  images  at  high  resolution,  this  may  need 
thousands  of  example  images.  Our  linear  object  class 
approach  also  assumes  that  the  orientation  of  an  object 
in  an  image  is  known.  The  orientation  of  faces  can  be 
approximated  computing  the  correlation  of  a  new  image 
to  templates  of  faces  in  various  orientations  [4].  It  is  not 
clear  how  precisely  the  orientation  should  be  estimated 
to  yield  satisfactory  results. 

Appendix 

A  Decomposing  objects  into  parts 

In  the  previous  section  we  considered  learning  the  ap¬ 
propriate  transformation  from  full  views.  In  this  case, 
the  examples  (prototypes)  must  have  the  same  dimen¬ 
sionality  as  a  full  view.  Our  arguments  above  show  that 
dimensionality  determines  the  number  of  example  pairs 
needed  for  a  correct  transformation.  This  section  sug¬ 
gests  that  components  of  an  object  -  i.e.  a  subset  of 
the  full  set  of  features  -  that  are  element  of  the  same 
object  class  may  be  used  to  learn  a  single  transforma¬ 
tion  with  a  reduced  number  of  examples,  because  of  the 
smaller  dimensionality  of  each  component.  We  rewrite 
equation  (1)  to  X  —  where  ^  is  the  matrix  formed 
by  the  q  vectors  Xi  arranged  column-wise  and  cx  is  the 
column  vector  of  the  coefficients.  The  basic  compo¬ 
nents  in  which  a  view  can  be  decomposed  are  given  by 
the  irreducible  submatrices  of  the  structure  matrix 
so  that  ^  0  ....  0  ^{k).  Each  submatrix 

represents  an  isolated  object  class,  formed  by  a  subset 
of  feature  points  which  we  would  like  to  call  a  part  of 
an  object.  As  an  example,  for  objects  composed  by  two 
cuboids  in  general  six  examples  would  be  necessary  since 
all  3D  views  of  objects  composed  of  two  cuboids  span  a 
six-dimensional  space  (we  suppose  a  fixed  angle  between 
the  cuboids).  However,  this  space  ^  is  the  direct  sum 
<|>  =  ^(1)  0  $(2)  of  two  three-dimensional  subspaces,  so 
three  examples  are  sufficient.  Notice  the  and 
are  only  identical  when  both  are  in  the  same  orienta¬ 
tion.  This  shows  that  the  problem  of  transforming  the 
2D  view  X  of  the  3D  objects  X  into  the  transformed  2D 
views  x^,  can  be  treated  separately  for  each  component 
x^k). 
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